Data block-based system and methods for predictive models

ABSTRACT

Systems and methods for recording information at a granular level; checking and verifying that data is used and processed is consistent with an entity&#39;s internal policies and/or external regulations; and producing reports to authorized users (e.g., individuals and organizations) with information are provided. The system and methods capture required data in an immutable fashion so that users outside of an entity (e.g., public, third parties) can check and audit that internal policies and other regulatory policies and frameworks are followed.

This application is a continuation of U.S. application Ser. No.17/990,601, filed Nov. 18, 2022, for DATA BLOCK-BASED SYSTEM AND METHODSFOR PREDICTIVE MODELS, which is a continuation of U.S. application Ser.No. 16/855,027, filed Apr. 22, 2020, for DATA BLOCK-BASED SYSTEM ANDMETHODS FOR PREDICTIVE MODELS, both of which are incorporated in theirentirety herein by reference.

FIELD OF THE DISCLOSURE

The present disclosure relates to cybersecurity, fraud and risk systems.More particularly, the disclosure relates to systems, methods andarchitecture for the collection and use of data, and the processing ofdata for analytic models and workflows.

BACKGROUND OF THE DISCLOSURE

Building and deploying cybersecurity, fraud and risk models usingconventional systems and methods have a number of challenges in tryingto ensure that relevant and data sharing policies, as well as allrequired regulatory requirements are followed. Applying known machinelearning techniques to build and deploy fraud and risk models requirescollecting the required data, cleaning and transforming the data tobuild the required features, either directly through feature engineeringor indirectly through deep learning, estimating the parameters of themodels, and then deploying the models into operational systems toprocess and score the data, either in batch or in near real time. Thereare policies and regulatory requirements about what data is collectedand how users are informed, what data is shared and with whom, what datais used to build models, what data is used as the inputs to models toproduce scores, and what actions are taken by systems based upon thescores. Users whose data is collected, members of data sharingconsortium, customers that buy collected or processed data, andcustomers that buy models or scores, and users whose interactions withsystems are determined in part by models and scores, each of whom havean interest in verifying claims that data, models, scores and actionsare all compliant with relevant policies and regulatory requirements.This task is quite difficult given how data is collected in conventionalsystems, in conjunction with how models are typically built andintegrated into user facing systems.

Conventional systems and methods have three primary disadvantages.First, regarding collection and use of data, data collection, dataaccess, and used for analytic models is not currently captured in a waythat provides a complete custody or “provenance chain.” Regardingprocessing of data for analytic models and workflows, methods ofcleaning, processing and aggregating data for fraud and risk analysis isnot captured in a way that provides a complete custody or provenancechain. This data may be needed to produce “features” or used as inputsto models or workflows and is subject to audit or to continuous checkingof regulatory requirements, privacy rules, or data sharing rules.Regarding inspection and auditing of data use and data processing,conventional methods and systems rely on manual creation and maintenanceof reports of the data flows that collect data and the ways that datamodels and scores are used by internal processes and systems. Thesereports further rely on manual updates as different data sources areused or as the system is changed. These reports are then used whenregulatory disclosures, audit reports and similar reporting is required.The manual nature of such reporting inserts time delay and is also asource of error that will propagate throughout the system. In practice,it is common that models and workflows are changed, but thedocumentation that is used for compliance purposes is not changed whichleads to a gap between the documentation that is consistent with thepolicies and regulations and the model that has drifted from andtherefore no longer compliant with the policies and regulations.

BRIEF SUMMARY OF THE DISCLOSURE

Embodiments of the present invention provide systems and methods for thecollection and use of data, and the processing of data for analyticmodels and workflows. The embodiments herein log each time data iscollected, accessed or processed. In an embodiment, a system for datasharing with a plurality of users, from an organization or a consortiumof organizations, and checking rule compliance is provided. The systemcomprises: a block-based storage system containing data blocks; a firstmodule coupled to the block-based storage system for creating andreading the data blocks; a second module adapted and configured tomanage at least one of logging of data collection, data access by atleast one of the plurality of users, data access by the system, and anexecution of workflows; and a third module adapted and configured toensure the system is compliant with a plurality of rules.

In an embodiment, the data blocks within the block-based storage systemcomprise at least one of data storage blocks and data provenance blocks.In an embodiment, the plurality of rules are at least one of datacollection rules, data sharing rules, privacy rules and regulatoryrequirements. In an embodiment, the first module is adapted andconfigured to be a centralized ledger. In an embodiment, the pluralityof users are all from the same organization or the same consortium oforganizations. In an embodiment, the third module is adapted andconfigured to enable at least one of the plurality of users from withinthe organization or the consortium of organizations to check whetherdata collection is compliant with at least one of the plurality ofrules. In an embodiment, third module is adapted and configured toenable at least one of the plurality of users, outside of theorganization or the consortium of organizations, with confirmation ofwhether at least of one of access to and processing of the data by thesystem, data sharing, building of models, use of scores and otheroutputs from models and workflows is compliant with at least one of theplurality of rules. In an embodiment, the first module is adapted andconfigured to use a block chain. In an embodiment, the third module isadapted and configured so that at least one of the plurality of usersare outside of an organization or consortium of organizations. In anembodiment, the third module is adapted and configured to enable atleast one of the plurality of users within an organization or theconsortium of organizations to check whether data collection iscompliant with at least one of the plurality of rules. In an embodiment,the third module is adapted and configured to enable at least one of theplurality of users, outside of the organization or the consortium oforganizations, to check whether data collection is compliant with atleast one of the plurality of rules. In an embodiment, the third moduleis adapted and configured so that at least one of the plurality ofusers, from within the organization or the consortium of organizations,with confirmation of whether at least one of access to and processing ofthe data blocks by the system, data sharing, building of models, use ofscores and other outputs from models and workflows is consistent with atleast one of the plurality of rules. In an embodiment, the third moduleis adapted and configured to provide at least one of the plurality ofusers outside of the organization or the consortium of organizationswith confirmation of whether at least one of access to and processing ofthe data blocks by the system, data sharing, building of models, use ofscores and other outputs from models and workflows is compliant with atleast one of the plurality of rules.

In an embodiment, the second module is adapted and configured to checkthat data collection is consistent with at least one of the plurality ofrules. In an embodiment, the second module is adapted and configured tocheck that the use of scores and other outputs from models and workflowsis compliant with at least one of the plurality of rules. In anembodiment, the second module is adapted and configured to check thataccess to and processing of the data modules by the system is compliantwith the at least one of the plurality of rules. In an embodiment, thesecond module is adapted and configured to check whether at least one ofdata sharing and building of models is consistent with at least one ofthe plurality of rules. In an embodiment, a log of access to the datablocks by the system is saved to the first module. In an embodiment, thethird module is adapted and configured to enable at least one of theplurality of users outside the organization or the consortium oforganizations with confirmation whether data collection is compliantwith at least one of the plurality of rules. In an embodiment, the thirdmodule is adapted and configured to enable at least one of the pluralityof users from within the organization or the consortium of organizationsto check whether at least of one of access to and processing of the datablocks by the system, data sharing, building of models, use of scoresand other outputs from models and workflows is compliant with at leastone of the plurality of rules.

These and other capabilities of the disclosed subject matter will bemore fully understood after a review of the following figures, detaileddescription, and claims. It is to be understood that the phraseology andterminology employed herein are for the purpose of description andshould not be regarded as limiting.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is illustrated and described herein withreference to the various drawings, in which like reference numbers areused to denote like system components, as appropriate, and in which:

FIG. 1 is a block diagram of a system according to embodiments of thepresent invention,

FIG. 2 is a diagram of the structure of an encrypted data block such asdata storage blocks and data provenance blocks within a system accordingto embodiments of the present invention;

FIG. 3 illustrates a data block according to embodiments of the presentinvention;

FIG. 4 illustrates how the Data and Model Policy Block (DMPB) modulelogs relevant data according to embodiments of the present invention;

FIG. 5 is a block diagram of a contract checker being used to accessencrypted provenance blocks according to an embodiment of the presentinvention;

FIG. 6 illustrates the identity, authorization and access management(IAM) module for both user access and system access to the data blocksand provenance blocks according to embodiments of the present invention;

FIG. 7 illustrates a method for model and analytic scoring logging andauditing according to embodiments of the present invention; and

FIG. 8 illustrates software modules for checking that an entity's (e.g.,a company, organization, or group) internal policies and externalpolicies are being supported according to embodiments of the presentinvention.

DETAILED DESCRIPTION OF THE DISCLOSURE

Embodiments of the invention include i) automated processes forrecording information at a granular level; ii) methods forchecking/verifying that data is used and processed is consistent with anentity's internal policies and/or external regulations, and iii) methodsfor producing reports to authorized users (e.g., individuals andorganizations) with information related to items i) and ii). Embodimentsalso include systems for capturing required data in an immutable fashionso that users outside of an entity (e.g., public, third parties) cancheck and audit that internal policies and other regulatory policies andframeworks are followed. These policies and frameworks may includeensuring that: i) data is collected appropriately; ii) data isappropriately processed to be used as inputs to fraud or risk models;iii) inputs are processed by fraud and risk models and workflows toproduce scores (e.g., risk scores) appropriately; and iv) scores areused by fraud and risk systems appropriately. In an embodiment, a systemis provided such that members of a consortium can also check and auditthese four processes. A consortium may, for example, check that userlocation data is not used for ad targeting or that user data is not usedto build risk models.

An example of a workflow for a risk model is when a user risk model isused to produce a score about the user's overall risk; a separatetransaction risk model is used to produce a score about the risk of aparticular transaction; both of the scores, plus additional inputs, areused as input to a third risk model that produces a third score thatintegrates the scores from the two models; and, the third score is usedas input to a fourth model, that rescales the score and applies certainbusiness rules, such as ignoring small dollar transactions that mayunnecessarily inconvenience the user compared to the potential reductionin risk to the organization. Examples of other outputs of models beyondscores include the confidence level associated with the score andcertain explanatory codes or strings that can be used to help explain tothe user why the score was particularly high or low.

There are several benefits and advantages of the embodiments providedherein, including but not limited to, the following examples.Embodiments herein utilize distributed data objects (e.g., data storageblocks) in order to capture all the relevant collected and processeddata at a scale and level of granularity required. Distributed dataobjects (e.g. provenance blocks) are provided that contain provenanceinformation about the data, how and when it is accessed, and how andwhen it is processed and “cell level access methods” are provided thatensure users are given access to precisely the data that they areauthorized to view at the granularity required. In addition, analyticprocessing of data is expressed in workflow languages and immutable logsof the workflows are created by the embodiments using provenance blockswhich are used to capture the internal processing of data, model inputs,model outputs, and system alerts and notifications at the scale andgranularity required. Further, distributed data and provenance blockswith blockchain or centralized ledgers according to embodiments hereinprovide access to public and consortium members to precisely the datathey are authorized to see.

FIG. 1 is an overview of a system 100 according to an embodiment of thepresent invention. The system 100 consists of a plurality of layers,with modules and components in layers communicating to the layers aboveor below using application programming interfaces (APis). Layer 1 is adistributed data block-based secure storage layer that includes one ormore data storage blocks 101 and one or more data provenance blocks 102in a data lake (not shown). Layer 2 is an infrastructure or managementmodule 103 coupled to Layer 1 for managing the data storage blocks 101and data provenance blocks 102. In an embodiment, the data block systemmanagement module 103 of Layer 2 includes a Data and Model ProvenanceBlocks (hereinafter “DMPB”) module 103 a. In an embodiment, Layer 2includes a centralized ledger 103 b. The management module 103 isadapted and configured to store and/or manage one or more of thefollowing: immutable cryptographically signed logs, claims, and otherassertions about user access, data access, data provenance, dataprocessing and related events. (Note, the acronym DMPB herein is usedthroughout this disclosure to describe the data and model provenanceblocks that contain information about how data is processed, and, inparticular, how data is processed to produce fraud and risk models.)

In an embodiment, DMPB module 103 a uses blockchain so that a mechanismcan be provided to members of the public who have contributed their owndata and interacted with the system can check how their data is used bythe system 100 and that this use is consistent with the requiredpolicies and regulations. Once a user registers with the system 100, theuser is assigned a random string of letters and numbers (i.e., the blockchain user ID) that is associated with all user related data in datastorage blocks 101 and all provenance related data in data provenanceblocks 102. Since the data storage blocks 101 and data provenance blocks102 may be immutable and cannot be changed once they are written, thesystem 100 may provide the user with the necessary information aboutwhat data of theirs was collected and how it was used.

Layer 3 is a logging module 104 that includes an identity, authorizationand access management (IAM) module 403. The logging module 104communicates to the DMPB module in Layer 2 via API calls. The IAM module403 provides: a) identity access management; b) role based andattribute-based access controls; c) fine-grained cell-based accesscontrols; and d) data provenance and auditing. The IAM module 104 writesimmutable cryptographically signed logs about user access, data access,data provenance, data processing and related events to Layer 2. Layer 4is a rule (e.g. regulatory and policy) analytics module 105 that isadapted and configured to provide real-time processing and auditing,including: a) continuous checking of data sharing rules; b) continuouschecking of privacy rules, c) continuous checking of regulatoryrequirements; and d) real-time auditing of the continuous checking ofsteps a), b) and c). Layer 5 is a fraud and analytics module 106 thatprovides functionality for building and deploying risk and fraud modelswith data provided by layer 1, with identify and access managementprovided by Layer 2, and with rules (e.g. data sharing, privacy rules,and regulatory requirements) checking provided by Layer 3.

Embodiments of the identity, authorization and access management TAMmodule 403 of Layer 3 and the rule analytics module 105 of Layer 4 maybe provided with either a centralized ledger 102 or DMPB module in Layer2. A public governance model for Layer 2 data blocks can be used, or aconsortium or federated governance model for a centralized ledger can beused so that access to the data is limited, for example, to partnersproviding data for the fraud and risk models or to partners deployingthe risk and fraud models developed by the system.

FIG. 2 illustrates the operation of the management module 103 of Layer 2in FIG. 1 . As the various steps required to build fraud and risk modelsare completed by the system 100, the completion of each step (e.g.,assertions or claims) is written by Layer 3 to a DMPB 103 a or to acentralized ledger 103 b in Layer 2, depending upon a desired governancemodel. The management module 103 provides all the information necessaryto check that fraud, risk and other models are collecting and processingdata appropriately as required by the rules (e.g., internal policies andexternal policies and regulations). As data is accessed by users (e.g.,User 1, User 2 . . . User n) within an organization 205, as data isprocessed to produce models, and as scores produced by models areprocessed, data and model provenance assertions/claims are written bythe IAM module 104 of Layer 3, as also illustrated in FIG. 2 .

Whenever data is collected or accessed, appropriate checks are made toensure that all required conditions and regulations are satisfied, andthe appropriate assertions/claims would be recorded by Layer 4. Datastorage blocks 101 are the smallest, most granular piece of informationthat is to be stored within the system 100. All data storage blocks 101are cryptographically bound to the visibility and sharing restrictionsin accordance with policies defined by the user such as encrypted datablock 301. Data storage blocks 101 are then encrypted for processing andpersistence using encryption header 302 and encrypted payload 303. In anembodiment, data storage blocks 101 are centralized. In anotherembodiment, data storage blocks 101 are geographically distributedwithin all applicable geographic regions to enable high-availability,failover, locality-based speed of response, and consistency of userexperience. Visibility of data storage blocks 101 is cryptographicallyattached to each data storage block 101. Access to these data storageblocks 101 requires the appropriate authorizations for secure datasharing based on user access visibility assessed through a SmartContract associated with contract checker 204.

Data provenance blocks 102 are a record of all interactions with datastorage blocks 101. They also provide an immutable record of how fraudand risk models are built and how they are used to process and to scoreuser data. When a data storage block 101 is created by a user within anorganization, who or what created it, when it was accessed, who or whataccessed it, why it was accessed, and where it was used are stored.Provenance blocks are used for patterns of life, attribution, pedigree,and lineage of the data blocks. This is a continuous process forappending immutable transaction details to the data block for itslifetime. Provenance records, unless otherwise prohibited by law orcustomer policy, are retained after data blocks are deleted foranalysis.

FIG. 3 illustrates the structure of an embodiment of an encrypted datablock 301 such as data storage block 101 or data provenance block 102,which may be utilized in Layer 1 of FIG. 1 . Encrypted data block 301includes an encryption header 302. The encryption header 302 containsthe information necessary so that each encrypted data block 301 may bepart of a distributed block-based storage system that may include aplurality of data storage blocks 101 and data provenance blocks 102. Theencrypted data block 301 also includes an encrypted payload 303, whoseencrypted key is provided in the encryption header 302.

The encrypted payload 303 is comprised of two parts: 1) a crypto header303 a, which contains provenance related information, and the associatedpayload 305. The crypto header 303 a contains a cryptographic signaturethat is used to verify the integrity of the data storage block 30, sothat it is immutable. This is necessary so that the encrypted data block301 itself can be audited by the Regulatory and Provenance Analytics ofthe rule analytics module 105. Finally, the payload 303 b contains theactual data being managed by the encrypted data block 301. This mayinclude the original data and/or provenance information about the datagenerated by the system 100. The payload 305 may contain severaldifferent types of data, that includes, but is not limited to: datacollected for analysis by the system, cleaned, aggregated, andtransformed data that are inputs to analytic models; the outputs ofanalytic models, which may be the inputs of other analytic models thatare part of an analytic workflow. Scores produced by analytic models oranalytic workflows; analytic models themselves in a serialized or otherformat so that they can be stored in one or more data storage blocks101. Rules that are used for post-processing analytic models andanalytic workflows before they are passed to other external interfacesand components. These rules are also in a serialized or other format sothat they can be stored in one or more data storage blocks 101.

Creation of provenance records. Provenance records in data provenanceblocks 102 are created by the system for a number of different reasonsand purposes, including, but not limited to when new data storage block101 is created, updated, or deleted. In an embodiment, data is onlydeleted or changed when required by the rules such as regulations orpolicy. Data is immutable and changes to data are made by appending thechanges to the current state of the data, or using another mechanism forcreating and maintain immutable data, so that there is a complete auditchain of all changes to the data under one or more of the followingconditions: when data storage blocks 101 are access by any user orsystem process; or when a policy requirement of a regulation changes theaccess rules for data. A regulation change may be, for example, thatprovenance records can be hidden after a requirement to purge datafollowing a request for the right to be forgotten.

Returning to FIG. 2 , the provenance & data block manager 201continuously evaluates data storage blocks 101 for changes in customerpolicy and enforces regulatory modifications required for access to thedata storage blocks 101. If a data block policy is updated for anyreason, it is tracked via data provenance block 102 for later analysis.The provenance & data block manager 201 is also adapted and configuredto provide auditing and precision data deletions periodically or asdesired. Centralized Ledger 202 is an immutable storage mechanism forthe provenance & data block manager 201. Centralized ledger 202 supportscontinuous auditing and transparency for the life of provenance block102 and data storage block 101. The encryption manager 203 providesfunctionality to associate inbound data storage block 101 and provenanceblock 102 with the appropriate encryption tokens and supports alignmentof users and/or processes access through the contract checker 204 (e.g.,Smart Contract Controller Gateway) or to the encryption tokens requiredfor access to the data within the data storage block 101. The contractchecker 204 is the entry point from the organization 205 to themanagement module 103. It facilitates authentication and authorizationfor each user or process that is to be granted access. It is the policydetermination point which verifies the user and/or process identity,location, and access privileges.

FIG. 4 illustrates how the logging module 104 logs all relevant data.The logging module 104 logs both user access 406 and system access 405.Both are authenticated and authorized using the identity, authorizationand access management (IAM module) 403. Based upon the IAM module 403,reads and writes are permitted on the data storage block 101 and dataprovenance block 102. The rules analytics module 105 accesses the dataprovenance blocks 102 to check that the appropriate policies andregulations are enforced. Note, data storage block 101 and dataprovenance block 102 are not part of the logging module 104, as can beseen in FIG. 4 .

FIG. 5 illustrates how smart contracts allow users and organizations toverify directly that internal entity policies and third-partyregulations are being followed, without an entity's participation or theparticipation from other third parties. For example, as illustrated inFIG. 5 , organization 205 (e.g., user group, entity, system processetc.) can use a contract checker 204 to access encrypted data blocks301. As long as the user or system request has access to the relevantdata, as determined by the IAM module 403, the provenance data withinthe encrypted data block 301 may be accessed and analyzed by the rulesanalytics module 105, 801, with the results returned to the requester.This is because the necessary provenance data has been stored inimmutable provenance blocks 102 by the model and analytics loggingmodule 402, as shown in FIG. 7 .

FIG. 6 shows how the fine-grained identity and access management (IAM)is handled by the IAM module 403 for both user access 406 and systemaccess 405 to data storage block 101 and data provenance block 102. Userand system identity is first authenticated with the authenticationservices 603. Once authenticated the access to a particular field ofdata storage block 101 or data provenance block 102 is provided by theauthorization service 604. Data may be from multiple data sources 605a-n, but access is provided to precisely the data source 605 a-n and toprecisely only the authorized fields within the data sources as needed.In other words, the authorization to access data is fine grained or,what is sometimes called cell based, and authorization is not providedto entire datasets, unless the user or system process is authorized toaccess the entire dataset. The data and provenance blocks are encryptedas shown in FIG. 3 and only decrypted when the user or system isauthorized to access a particular field or fields.

FIG. 7 is a flow chart of a method for model and analytic scoring,logging and auditing according to an embodiment of the presentinvention. Provenance records are kept of the workflows used for thefollowing steps. Step 701 collects data from multiple sources 605 a-n.Step 2 cleans and normalizes the data. Step 3 computes features (e.g.,features for aggregating or transforming data). Step 704 uses thefeatures as inputs to models and workflows. Step 705 uses the scoresproduced by the models and workflows in step 704 as inputs to thepost-processing rules to produce the final scores and other outputs. Theprocessing steps 701 through 705 may be implemented or expressed indifferent ways.

One of the common implementations is to express each workflow as adirected acyclic graph (DAG), in which each node of the graph is asoftware program or application called, and with a directed edge betweentwo nodes indicating how the outputs from one node are used as theinputs to another node. Each software program or application is labeledwith a unique label and available in an environment or framework thatallows its execution. For example, in an embodiment, the softwareprogram or application may be in a Docker container or other container,which provides a virtualized environment that encapsulates softwareapplications and all the required libraries and configuration files.Alternately, in another implementation, the software program orapplication may be part of a serverless framework. In this context, acontainer is a packaging of software and the necessary softwarelibraries and configuration files so that the container may be run usinga cloud-computing platform as a service execution model that usesvirtualization to: i) support the execution of programs withincontainers, and ii) the ability of containers to communicate with othercontainers, as specified in appropriate configuration files. In thiscontext, a serverless framework is another cloud-computing executionmodel in which the cloud service provider runs the server or serversexecuting the software code, and dynamically manages the allocation ofmachine resources required to run the server or servers.

In this way, each node in each workflow corresponding to a softwareprogram or application is assigned a unique label and this informationis persisted in an immutable provenance block 102. In addition, eachworkflow is assigned a unique label and also persisted in immutableprovenance blocks. In this way, provenance information persists in theprovenance blocks 102 capturing the data source 605 n and the processingworkflow steps 701, 702, . . . , 705. This enables the logging module402 to associate an immutable provenance record with each score or otheroutput produced by the fraud and risk analytics module 106 of FIG. 2 .Given these provenance records, the rules module 105 can review thecompliance of the scoring either record by record as the scoring isdone, or periodically by examining batches of scored data and theirassociated provenance record. As an alternative implementation, a singleprovenance record that contains the totality of informationcharacterizing the data source 605 n and the processing 701, 702, . . ., 705 can be associated and used to provide the immutable provenanceinformation for each single score or for a batch of scores produced bythe fraud and risk analytics system 106. In some implementations, theclean and normalized data from step 702 is directly used by the analyticmodels and workflow step 704 and does not need necessarily to computefeatures in step 703. This is the case, for example, with deep learningmodels.

FIG. 8 illustrates software modules for checking that an entity'sinternal policies and external policies required by users, datasuppliers, and third-party regulatory agencies, industry best practices,and others are being supported. For example, a particular user may nothave given permission for his or her historical purchases be used toscore fraud and risk models. In this case, when the module 802 developeda fraud or risk model using data available in the datamart 702, module801 would verify that this user's historical purchases, or data derivedfrom this user's historical purchases, would not be used as part of thetraining data from datamart 702 by module 802 to build any models. Inaddition, as part of an audit analysis of historical scores, provenanceblocks 102 could be analyzed to be sure that the user's historicalpurchases were not used to build any models. Similarly, provenancerecords 102 could be analyzed for any user of interest to check whetherany model deployed by module 803 used data from the user being auditedto ensure this data was consistent with the user's preferences at thattime as provided by the user data 401, and as recorded in the relevantimmutable data provenance block 101. In this way, audit reports 806 canbe produced over batches of data, as well as alerts 804 and real-timemonitoring records 805 for individual records being scored, as providedby the model and analytics logging data 402.

The subject matter described herein can be implemented in digitalelectronic circuitry, or in computer software, firmware, or hardware,including the structural means disclosed in this specification andstructural equivalents thereof, or in combinations of them. The subjectmatter described herein can be implemented as one or more computerprogram products, such as one or more computer programs tangiblyembodied in an information carrier (e.g., in a machine readable storagedevice), or embodied in a propagated signal, for execution by, or tocontrol the operation of, data processing apparatus (e.g., aprogrammable processor, a computer, or multiple computers). A computerprogram (also known as a program, software, software application, orcode) can be written in any form of programming language, includingcompiled or interpreted languages, and it can be deployed in any form,including as a stand-alone program or as a module, component,subroutine, or other unit suitable for use in a computing environment. Acomputer program does not necessarily correspond to a file. A programcan be stored in a portion of a file that holds other programs or data,in a single file dedicated to the program in question, or in multiplecoordinated files (e.g., files that store one or more modules, subprograms, or portions of code). A computer program can be deployed to beexecuted on one computer or on multiple computers at one site ordistributed across multiple sites and interconnected by a communicationnetwork.

The processes and logic flows described in this specification, includingthe method steps of the subject matter described herein, can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions of the subject matter describedherein by operating on input data and generating output. The processesand logic flows can also be performed by, and apparatus of the subjectmatter described herein can be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processor of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto optical disks, or optical disks. Information carrierssuitable for embodying computer program instructions and data includeall forms of nonvolatile memory, including by way of examplesemiconductor memory devices, (e.g., EPROM, EEPROM, and flash memorydevices); magnetic disks, (e.g., internal hard disks or removabledisks); magneto optical disks; and optical disks (e.g., CD and DVDdisks). The processor and the memory can be supplemented by, orincorporated in, special purpose logic circuitry.

The subject matter described herein can be implemented in a computingsystem that includes a back end component (e.g., a data server), amiddleware component (e.g., an application server), or a front endcomponent (e.g., a client computer mobile device, wearable device,having a graphical user interface or a web browser through which a usercan interact with an implementation of the subject matter describedherein), or any combination of such back end, middleware, and front endcomponents. The components of the system can be interconnected by anyform or medium of digital data communication, e.g., a communicationnetwork. Examples of communication networks include a local area network(“LAN”) and a wide area network (“WAN”), e.g., the Internet.

It is to be understood that the disclosed subject matter is not limitedin its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The disclosed subject matter is capable ofother embodiments and of being practiced and carried out in variousways. Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting. As such, those skilled in the art will appreciatethat the conception, upon which this disclosure is based, may readily beutilized as a basis for the designing of other structures, methods, andsystems for carrying out the several purposes of the disclosed subjectmatter. It is important, therefore, that the claims be regarded asincluding such equivalent constructions insofar as they do not departfrom the spirit and scope of the disclosed subject matter.

Although the disclosed subject matter has been described and illustratedin the foregoing exemplary embodiments, it is understood that thepresent disclosure has been made only by way of example, and thatnumerous changes in the details of implementation of the disclosedsubject matter may be made without departing from the spirit and scopeof the disclosed subject matter, which is limited only by the claimswhich follow.

What is claimed is:
 1. A system for data sharing with a plurality ofusers and for checking rule compliance, the system comprising: at leastone processor; a plurality of data storage blocks stored in ablock-based storage system, wherein data included in the plurality ofdata storage blocks are used in at least one of building an analyticmodel and an analytic workflow; a plurality of data provenance blocksstored in the block-based storage system, wherein at least one of theplurality of data provenance blocks stores provenance data related tobuilding one of an analytic model and an analytic workflow or related tousing one of an analytic model and an analytic workflow to score dataand produce outputs; a management module coupled to the block-basedstorage system and configured to run on one of the at least oneprocessor to create and read the data storage blocks and the dataprovenance blocks of the block-based storage system; and a loggingmodule coupled to the management module, and configured to run on one ofthe at least one processor to manage at least one of logging of using atleast a portion of the data stored in the data storage blocks to buildan analytic model or an analytic workflow, and logging of the using ofat least a portion of the data stored in the data storage blocks toscore that data using one of an analytic model and an analytic workflowto produce outputs, wherein the logging module is further configured tostore the logging data in the data provenance blocks of the block-basedstorage system.
 2. The system of claim 1, further comprising a rulesanalytic module coupled to the logging module and configured to run onone of the at least one processor to read the created data provenanceblocks, and check each data provenance block for compliance with each ofa plurality of rules.
 3. The system of claim 2, wherein the plurality ofrules are at least one of data collection rules, data sharing rules,privacy rules and regulatory requirements.
 4. The system of claim 2,wherein the plurality of users are all from the same organization or thesame consortium of organizations.
 5. The system of claim 4, wherein therules analytic module is configured to enable at least one of theplurality of users from within the organization or the consortium oforganizations to check whether data collection is compliant with atleast one of the plurality of rules.
 6. The system of claim 4, whereinthe rules analytic module is configured to enable at least one of theplurality of users within the organization or the consortium oforganizations with confirmation of whether data collection is compliantwith at least one of the plurality of rules.
 7. The system of claim 4,wherein the rules analytic module is configured to enable at least oneof the plurality of users within the organization or the consortium oforganizations to check whether data collection is compliant with atleast one of the plurality of rules.
 8. The system of claim 4, whereinthe rules analytic module is configured to provide at least one of theplurality of users with confirmation of whether at least one of accessto and processing of the data blocks by the system, data sharing,building of models, use of scores and other outputs from models andworkflows is consistent with at least one of the plurality of rules. 9.The system of claim 4, wherein the rules analytic module is configuredto provide at least one of the plurality of users with confirmation ofwhether at least one of access to and processing of the data blocks bythe system, data sharing, building of models, use of scores and otheroutputs from models and workflows is compliant with at least one of theplurality of rules.
 10. The system of claim 2, wherein the rulesanalytic module is configured so that at least one of the plurality ofusers are outside of an organization or consortium of organizations. 11.The system of claim 2, wherein the rules analytic module is configuredto provide at least one of the plurality of users outside of theorganization or the consortium of organizations with confirmation ofwhether at least one of access to and processing of the data by thesystem, data sharing, building of models, use of scores and otheroutputs from models and workflows is consistent with at least one of theplurality of rules.
 12. The system of claim 10, wherein the rulesanalytic module is configured to provide at least one of the pluralityof users outside of the organization or the consortium of organizationswith confirmation of whether at least one of access to and processing ofthe data by the system, data sharing, building of models, use of scoresand other outputs from models and workflows is compliant with at leastone of the plurality of rules.
 13. The system of claim 10, wherein therules analytic module is configured to enable at least one of theplurality of users outside of the organization or the consortium oforganizations to check whether data collection is compliant with atleast one of the plurality of rules.
 14. The system of claim 10, whereinthe rules analytic module is configured to provide at least one of theplurality of users outside the organization or the consortium oforganizations with confirmation of whether data collection is compliantwith at least one of the plurality of rules.
 15. The system of claim 2,wherein the rules analytic module is configured to check that access toand processing of the data storage blocks by the system is compliantwith the at least one of the plurality of rules.
 16. The system of claim2, wherein the rules analytic module is configured to check whether atleast one of data sharing and building of an analytic model is compliantwith at least one of the plurality of rules.
 17. The system of claim 1,wherein the management module includes a centralized ledger.
 18. Thesystem of claim 1, wherein the management module is adapted andconfigured to use a block chain.
 19. The system of claim 1, wherein alog of access to the data blocks by the system is saved to themanagement module.
 20. The system of claim 1, wherein said dataprovenance blocks comprise immutable records of interactions with saiddata storage blocks.